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DATA COMPRESSION SYSTEM BASED ON TREE MODELS 

This application is a continuation-in-part of co-pending U.S. Patent Application 
10/768,904, entitled "FSM Closure of Generalized Tree Models," inventors Alvaro 
5 Martin, et al., filed January 29, 2004, the entirety of which is incorporated herein by 
reference. 

BACKGROUND OF THE INVENTION 
Field of the Invention 

The present invention relates generally to the field of data compression, and more 
10 specifically to linear time universal coding for the class of tree models. 

Description of the Related Art 

Information sources emit symbols from a given alphabet according to some 
probability distribution. In particular, finite memory sources employ a finite number of 
contiguous past observations to determine the conditional probability of the next emitted 

15 symbol. In many instances employing conditional probability prediction, the memory 
length, i.e. the number of past symbols that determine the probability distribution of the 
next one, depends on the data received and can vary from location to location. Due to 
this variance in memory length, a Markov model of some order m fit to the data is 
generally not efficient in determining conditional probability for next emitted symbols. In 

20 such a Markov model, the number of states grows exponentially with m, thus providing a 
significantly complex resultant model including equivalent states yielding identical 
conditional probabilities. In general, when considering a Markov model, removing 
redundant parameters and reducing the total number of states can provide enhanced 
overall performance. 

25 Reduced Markov models of information sources have been termed "tree models," 

as they can be graphically represented using a simple tree structure. A "tree model" 
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includes an underlying full a-ary tree structure and a set of conditional probability 
distributions on the alphabet, one associated with each leaf of the tree, where each leaf 
corresponds to a "state." An a-ary tree structure includes, for example, binary trees, 
tertiary trees, and so forth, where a is the size of the source alphabet A, The model 
5 associates probabilities to sequences as a product of symbol probabilities conditioned on 
the states. The appeal of tree models is the ability to capture redundancies typical of real 
life data, such as text or images, while at the same time providing the ability to be 
optimally estimated using known algorithms, including but not limited to the Context 
algorithm. Tree models have been widely used for data modeling in data compression, 
10 but are also useful in data processing applications requiring a statistical model of the data, 
such as prediction, filtering, and denoising. 

The use of statistical models, such as tree models, in lossless data compression is 
facilitated by arithmetic codes. Given a sequence x of length n over the alphabet A, and a 
statistical model that assigns a probability P(x) to the sequence x, an arithmetic encoder 

15 can efficiently assign a codeword (for example, over the binary alphabet {0,1 }) of length 
slightly larger, but as close as desired, to the smallest integer not smaller than 
log (l/P(x)), where the logarithm is taken in base 2. A corresponding decoder can decode 
the codeword to recover the sequence x. For the code to be effective, the goal is to make 
the code length as short as possible, and lossless data compression requires exact 

20 recovery of the original sequence x by the decoder. A universal lossless data 

compression system aims to assign to every sequence x a codeword of length that 
approaches, as the length n of the sequence grows, the length assigned by the best 
statistical model in a given "universe" or class of models. When the statistical models are 
determined by K free parameters, for most sequences x this target can only be achieved 

25 up to an excess code length of (K log n)/(2n) + Q(K/n) bits per input symbol. While 

increasing the dimensionality K of the class of models decreases the target code length, 
the unavoidable excess code length over this target code length increases. An optimal 
lossless data compression system aims at finding the best trade-off value for the number 
of parameters K. 



0 
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For the class of tree models of any size, this optimal trade-off is achieved by 
codes such as CTW and Context. Any given tree determines a class of tree models with a 
number of free parameters K given by the number of its states times a -1, since a -I free 
parameters per state determine each conditional distribution. For any tree having K free 
5 parameters and any sequence of length w, CTW and Context provide, without prior 

knowledge of the tree or K 9 a normalized excess code length of at most {K log n)/(2n) + 
0(K/n) bits over the shortest code length assigned by the best tree model supported by the 
tree. In the "semi predictive" variant of Context, a system seeks to estimate a best tree 
model, and describes the corresponding best tree to a decoder in a first pass. 

10 Determination of the best tree takes into account the number of bits needed to describe 
the tree itself, and a code length based on model parameters that are sequentially 
estimated by the encoder. The system sequentially encodes data based on the described 
tree in a second pass, using the model parameters estimated sequentially. Therefore, the 
parameters dynamically change during the encoding process. Given the tree, the decoder 

15 can mimic the same estimation scheme and therefore it needs not be explicitly informed 
of the parameters by the encoder. Such a design is suggested in, for example, J. 
Rissanen, "Stochastic complexity and modeling," Annals of Statistics, vol. 14, pp. 1080- 
1 100, September 1986. Determination of the best tree model requires "pruning" a tree, 
called a context tree, containing information on all occurrences of each symbol in every 

20 context. The second pass encoding entails assigning a conditional probability to each 
symbol sequentially based on previous occurrences of symbols in its context, and 
encoding the symbol using an arithmetic code. The decoder can reverse the second pass 
encoding operations. 

One problem with using tree models is the cost associated with transitioning from 
25 one state to the next state. In principle, for a general tree model, knowledge of the 

current state and the next input symbol might not be sufficient to determine the next state. 
Determination of the latter generally entails traversing the tree from its root, and 
following branches according to the sequence of symbols preceding the current symbol. 
For general trees, such a procedure typically requires a number of steps that cannot be 
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bonnded by a constant. Thus, transitioning from one state to another is generally 
expensive from a computational perspective, and use of such trees can add complexity to 
the system. 

Another computational problem with using tree models is that efficient 
implementations thereof require that the data collection to build the context tree be done 
in a compact suffix tree of the encoded sequence. Since compact trees need not be full, 
and their edges may be labeled by strings of length greater than one, the tree 
corresponding to the optimal tree model for a given sequence will generally not be a sub- 
tree of the suffix tree of the sequence, as it may contain paths not in the original sequence 
but were added to make the tree full. This phenomenon complicates the pruning process. 

On the other hand, certain popular data compression algorithms, such as PPM and 
algorithms based on the Burrows- Wheeler transform, or BWT, are also based on tree 
models, but do not achieve optimal redundancy as the CTW and Context methods. 

Based on the foregoing, it would be advantageous to offer a relatively simple 
coding method, which is relatively optimal for the class of tree models, using trees or 
tree structures in an efficient manner. 
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SUMMARY OF THE INVENTION 

According to a first aspect of the present design, there is provided a method for 
encoding a sequence into a concatenated string. The method comprises building a suffix 
tree of the sequence in reverse order, pruning the suffix tree to form a generalized context 
5 tree (GCT) having a plurality of states, obtaining a binary representation of a full tree 
derived from the GCT, encoding the sequence into a binary string using a dynamic tree 
model based on statistics collected at the plurality of states of the GCT, and 
concatenating the binary representation of the full tree with the binary string to form the 
concatenated string. 

10 According to a second aspect of the present design, there is provided a method for 

encoding a sequence into a concatenated string. The method comprises building a suffix 
tree of the sequence in reverse order, pruning the suffix tree to form a generalized context 
tree (GCT) having a plurality of states, building a finite state machine (FSM) closure of 
the GCT to form an FSM closed GCT, obtaining a binary representation of a full tree 

15 derived from the GCT, encoding the sequence into a binary string using a dynamic tree 
model based on statistics collected at the plurality of states of the GCT, transitioning to a 
next state of the GCT with the FSM closed GCT, and concatenating the binary 
representation of the full tree with the binary string to form the concatenated string. 

According to a third aspect of the present design, there is provided a method for 
20 decoding a binary string, the binary string comprising a binary representation of a full 

tree having a plurality of states associated therewith, and an encoded string produced by a 
corresponding encoder using a dynamic tree model based on the full tree. The method 
comprises building a finite state machine (FSM) closure of the full tree, iteratively 
decoding at least one symbol using the dynamic tree model of the corresponding encoder 
25 based on statistics collected at the plurality of states of the full tree, and transitioning to 
the next state of the full tree using the FSM closed full tree. 
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According to a fourth aspect of the present design, there is provided a method for 
decoding a binary string, the binary string comprising a binary representation of a 
generalized context tree (GCT) and an encoded string produced by a corresponding 
encoder using a dynamic tree model based on the GCT, the GCT having a plurality of 
5 states associated therewith. The method comprises building a decoding GCT based on 
the binary representation of the GCT, building a finite state machine (FSM) closure of the 
decoding GCT, iteratively decoding at least one symbol using the dynamic tree model of 
the corresponding encoder based on statistics collected at the plurality of states of the 
decoding GCT, and transitioning to a next state of the decoding GCT using the FSM 
1 0 closed decoding GCT. 

According to a fifth aspect of the present design, there is provided a method for 
decoding a binary string, the binary string comprising a binary representation of a full 
tree and an encoded string produced by a corresponding encoder using a dynamic tree 
model based on the full tree, the full tree having a plurality of states associated therewith. 

15 The method comprises building a decoding full tree based on the binary representation of 
the full tree, creating a reduced generalized context tree (GCT) and mapping the reduced 
GCT to the decoding full tree, building a finite state machine (FSM) closure of the 
reduced GCT, iteratively decoding at least one symbol using the dynamic tree model of 
the corresponding encoder based on statistics collected at the plurality of states of the 

20 decoding full tree, and transitioning to a next state of the decoding full tree using the 
FSM closed reduced GCT. 

According to a sixth aspect of the present design, there is provided a method for 
decoding a binary string, the binary string comprising a binary representation of a full 
tree and an encoded string produced by a corresponding encoder using a dynamic tree 
25 model based on the full tree, the £1x11 tree having a plurality of states associated therewith. 
The method comprises building a decoding full tree based on the binary representation of 
the full tree, creating a reduced generalized context tree (GCT), building a finite state 
machine (FSM) closure of the reduced GCT, iteratively decoding at least one symbol 
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using the dynamic tree model of the corresponding encoder based on statistics collected 
at the plurality of states of the decoding full tree, transitioning to a next state of the 
decoding full tree using the FSM closed reduced GCT, and adding encountered states of 
the decoding full tree and suffixes thereof to the FSM closure of the reduced GCT. 

5 According to a seventh aspect of the present design, there is provided a method 

for decoding a binary string ty, formed by concatenating binary strings t and jy, into a 
resultant string x, the binary string ty comprising a binary representation t of a tree Tand 
an encoded stringy produced by a corresponding encoder using a dynamic tree model 
based on the tree T. The method comprises building the tree J based on the binary 
10 representation t, setting the resultant string x to an empty string, iteratively decoding at 
least one symbol using the dynamic tree model of the corresponding encoder based on 
statistics collected at a state given by a longest ancestor of the reversed resultant string x 
originally in Tand filling the resultant string x with decoded symbols, and inserting the 
reversed resultant string x in T. 

15 These and other objects and advantages of all aspects of the present invention will 

become apparent to those skilled in the art after having read the following detailed 
disclosure of the preferred embodiments illustrated in the following drawings. 
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DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example, and not by way of 
limitation, in the figures of the accompanying drawings in which: 

FIG. 1 illustrates a basic encoder-decoder design for a source, such as an 
5 information source; 

FIG. 2A represents a binary tree structure; 

FIG. 2B presents two trees for the string aeceaceae; 

FIG. 3 illustrates a string x having a prefix and suffix; 

FIG. 4 is a binary context tree; 

10 FIG. 5 shows an FSM closure of the non-FSM closed binary context tree of FIG. 

4, including two new leaves; 

FIG. 6 shows the finite state machine associated with the T f of FIG. 5; 

FIG. 7 shows a graphical representation of operation of an encoder according to 
an embodiment of the present design; 

15 FIG. 8 is a graphical representation of one embodiment of the decoder using the 

concept of an incomplete FSM closure; 

FIG. 9 illustrates a graphical representation of an additional embodiment of the 
decoder using the concept of an incremental FSM closure construction; 

FIG. 10 shows a graphical representation of a further embodiment of the decoder 
20 using the concept of incremental suffix tree construction; 

FIG. 1 1 illustrates the concept of a Jump structure and creation thereof; 



0 
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FIG. 12 is the processing of the decoder according to the concept of an 
incomplete FSM closure; 

FIG. 13 shows the processing of the decoder according to the concept of an 
incremental FSM closure construction; and 

FIG. 14 is the processing of the decoder according to the concept of incremental 
suffix tree construction. 
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DET AILED DESCRIPTION OF THE INVENTION 



The present design operates by taking a sequence, constructing a Generalized 
Context Tree (GCT) that models a source that could have emitted the sequence, and 
optionally refining the GCT by adding leaves and/or internal nodes, where necessary, 
5 such that the refined GCT has a finite state machine (FSM) property. Such construction 
is referred to as computing an "FSM closure" on the GCT, thereby forming a resultant 
tree, and is described in detail inU.S. Patent Application 10/768,904, entitled "FSM 
Closure of Generalized Tree Models," inventors Alvaro Martin et al., the entirety of 
which is incorporated herein by reference. Intermediate trees may be formed in the 

10 process, such as when filling the GCT with leaves and/or internal nodes. The present 

design may alternately be considered to receive a string, build a suffix tree of the string in 
reverse order, prune the suffix tree to form a pruned tree, and build an FSM closure of the 
pruned tree to form an FSM closed tree. The FSM closed tree is used by the present 
system to sequentially assign probabilities to the symbols in the input sequence, with the 

15 purpose of, e.g., encoding the sequence. The present system provides information about 
the pruned tree to a decoder, which can reconstruct the FSM closure and utilize the tree in 
various ways to decode the sequence. Tree construction, encoding, and reconstruction 
processes may operate in a time frame linear in the length of the input string. The present 
design discloses the manner in which the sequence may be encoded and various ways in 

20 which the resultant bitstream may be decoded. 

Definitions 

As used herein, the terms "algorithm," "program," "routine," and "subroutine" 
will be generally used interchangeably to mean the execution functionality of the present 
design. The term "subroutine" is generally intended to mean a sub program or ancillary 
25 algorithm, called from the main program, that may be associated with or subordinate to 
the main program or algorithm. 
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Also as used herein, A represents an alphabet of a available symbols, a being 
greater than or equal to 2. The values A*, A + , and A m denote, respectively, the set of 
finite strings, the set of positive length strings, and the set of strings of length m, where m 
is greater than 0, over the set of symbols. Variables a, b, and c represent symbols from 
5 alphabet A, while r, s, t, u, v, w, x, y, and z represent strings in A*. The notation x* is 

used to denote the i-th symbol of x, while x 1 denotes the sub string xjx 2 . . .Xi. The reverse 
of a string x is x , equal to x k x k _ x where k is the length of x. Length of a string x is 

represented as |x|. The null string, a string of length zero, is denoted X. "uv" is the 
concatenation of strings u and v. 

10 Further, as used herein, the terms "prefix" and "suffix" are illustrated by, for 

example, a string t equal to uvw, where u, v, and w are also strings. In this case, u is a 
"prefix" of t, v is a "t-word," and w is a "suffix" oft. The phrase "u is a prefix of v" is 
written as "w^ " If u is prefix of v and |u| is less than |v|, u is said to be a "proper prefix" 
of v. An analogous definition applies to "proper suffix". For a string u, head(u) is the 

15 first symbol of u, and tail(u), also known as the suffix tail, is its longest proper suffix. 

A typical binary tree structure is illustrated in FIG. 2A for purposes of identifying 
the terminology used herein. Tree structure 200 includes a set of "nodes" such as node 
201 or node 202. Nodes are joined by "edges," such as edge 203. Edges are assumed to 
be directed, or have a direction associated therewith. In the example binary tree structure 

20 of FIG. 2 A, and in the other illustrations of this application, edges are directed from top 
to bottom. If an edge originates at node x and ends at node y, x is the "parent" of>>, andj> 
is a "child" of x. Each node has a unique parent, except for one distinguished node 
referred to as the "root." In FIG. 2A, node 210 is the root, and node 201 is the parent of 
nodes 202 and 212, which are the children of node 201. A "leaf is a node with no 

25 children, such as node 202. An "internal node" is any node, such as node 201, that is not 
a leaf. 
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Each edge in the tree is labeled with a string from A + , such as string "11" in edge 
204, or string "1" in edge 203. Edges departing from a node are typically labeled with 
strings starting with different symbols, and each node has at most as many children as the 
size of the alphabet a . An edge is "atomic" if it is labeled by a single-symbol string, 
5 such as edge 203 in FIG. 2A. An edge that is not atomic is called "composite," such as 
edge 204. The term "atomic tree" designates a tree where every edge in the tree is 
atomic. Every node in a tree is associated with a string, composed by concatenating the 
labels of all the edges in the path from the root to the node. For example, in FIG. 2 A, 
node 213 is associated with the string "101 1 1," and node 214 is associated with the string 
10 "111." Nodes are identified herein by their associated strings. For instance if u is a 

string, the node whose associated string equals u will be simply referred to as "node u". 
Also, all operations defined over strings may be applied to nodes with the understanding 
that the operations are applied to the associated strings. For example if v is a node, |v| 
denotes the length of the string associated to node v. 

1 5 A node is called a "branching node" if it has at least two children. A tree is 

"compact" if every node in T is either the root, a leaf, or a branching node. A tree is 
"full" if the tree is atomic and the number of branches emanating from every node is 
either zero or a, where a is the size of the alphabet A. In the case of a binary tree, for 
example, a is 2, and a full tree has two branches emanating from every internal node, 

20 with no branches emanating from any leaf. FIG 4 illustrates a full binary tree. 

Consider a string xix 2 . . .x n , and its substring xjx 2 . . .x i5 with i less than n, and a full 
tree T. Starting at the root, and following branches by matching their labels to symbols 
from the reversed substring XiXj.i . . .xi, one eventually reaches a leaf of T, provided the 
number i is large enough (e.g., larger than the length of the longest string associated with 
25 a leaf of T). That leaf is referred to as the "state" determined by the string xix 2 . . .xj, 

which is also the state in which symbol x i+ i is processed in data processing applications 
using the tree T. For example, for the tree T of FIG. 4, the state determined by string 
"0101 11" is leaf 402. Full trees used for determining states are termed "context trees" as 
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the state used for processing x i+ i corresponds to a sub-string of x preceding x i+ i, known 
as a "context" for that occurrence of the symbol in x. 

When a tree T is not full or when it is not atomic, nodes that are not leaves may 
determine states. In general, for sufficiently large i, the state determined by x l is the last 
5 node of the tree visited while traversing the tree as described above, before "falling off 
the tree. For example, for the tree of FIG 2 A, the state determined by string "101001 1" is 
node 205. In this case, the tree is termed a Generalized Context Tree, or GCT. 

A tree can be considered a set of strings, namely the set of strings associated with 
the tree's nodes and all its prefixes. Each string belonging to the set of strings 
10 represented by a tree T is said to be a word of T and the set may be denoted WORDS(T). 

As used herein, the term "suffix tree" is used interchangeably with the term 
"compact suffix tree". The suffix tree or compact suffix tree of a string t refers to a 
compact representation of a tree T such that WORDS(T) equals the set of all t- words. 

FIG. 1 illustrates a simplified version of an arrangement wherein the present 
15 design may be employed. Encoder 101 encodes the symbol stream received from a 
source 103, such as an information source, and may contain the algorithm disclosed 
herein as well as the hardware on which the algorithm operates. Alternately, a third 
location (not shown) may be employed to operate the algorithm and transmit the 
optimized tree structure(s) to the encoder 101 and decoder 102. Decoder 102 receives the 
20 tree structure and thus the states computed by the algorithm, as well as the encoded series 
of symbols, and decodes the symbols and reassembles the string. In a typical 
environment, the medium for transmission may be over the air, over wire, or any other 
medium known for transmission of signals. 

Generalized Context Trees, Finite State Machines, and FSM Closure 

25 Generalized Context Trees and Finite State Machines are two different ways of 

assigning a unique state from a finite set to any string x k of A*. In the case of GCTs, the 
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state is determined, for sufficiently long strings, by the last node visited while traversing 
the tree from the root following the path determined by x , before "falling off the tree. 
More formally, for a GCT T and arbitrary string y, the canonical decomposition of y with 
respect to T is C T (y) equals (r, u, v), where r is the longest prefix of y that is a node of T, 
ru is the longest prefix of y that is a word of T, and y equals ruv. The first component of 
C T (y), namely r, is denoted V T (y). 

As shown in FIG. 2B, a canonical decomposition follows the path defined by y 
starting at the root and proceeds down the tree T by matching symbols on its edge labels, 
r is the last node visited, and v is the suffix of y starting at the mismatch point, or the part 
of y that falls off the tree. From FIG. 2B, assume y is the string aeceaecae. For the upper 
tree 250, beginning with the root, progression moves forward to node a and symbols 
e,c,e,a before falling off. In this case, r equals node a, u equals "ecea" and v is "ecae". 
For the non-compact lower tree 260 of FIG. 2B, beginning with the root, progression 
moves forward to a, e, c, e, a, and then the "ecae" string falls off the tree. Thus node 288 
is the last node, or r, u is the null string and v is the suffix "ecae." r, u, and/or v may be 
null strings. 

For a given tree T, S T represents the set of nodes s such that s has less than a 
children, or s has a composite outgoing edge. S* denotes the set of strings w$ where $ is 
a special symbol that does not belong to the alphabet, and w is a word of T that is not a 
leaf of T. The set of states for T is defined as the union of S T and S * , =S T ^>S*. 

The function that assigns states to strings for a given tree T is known as the "tree- 
state function" and is defined as s T : A* -> and 




(1) 



The symbol $ can be interpreted as a conceptual marker preceding the first actual symbol 
ofx n . 
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The first case of Equation (1) is true for sufficiently long strings, and in this case 
s T (x n ) g S T . For short strings, the second case in Equation (1) may be true, in which 
case s T (x n )gS*. Note that only one string selects each state in S* . These states are 
called "transient states". On the other hand, arbitrarily long strings select states in S T , and 
5 these states are termed "permanent states". 

When T is a "full tree," the set of permanent states of GCT T is equal to the set of 
end nodes or leaves. For the GCT of FIG. 4, for example, S T , or the set of states of tree 
T, is {0, 100, 101, 11}. 

A finite state machine (FSM) over A is defined as: 

10 F = (S,f, So ) (2) 

where S is a set of states, f:SxA^ S is a next state function, and s Q , an element of S, is 
the initial state. For an FSM, the state function is recursively defined by the next state 
function starting from initial state s 0 , or in other words the state assigned to a string x k is 
f(. . .f(f(s 0 , xi), x 2 ). . ., x k ). The concept of "permanent state" is also defined for an FSM 
15 where a state s is "permanent" if there exist arbitrarily long strings x l such that f(. . .f(f(s 0 , 
xi), x 2 ). . ., Xi) equals s, or in other words x 1 selects state s. Otherwise, the state is 
"transient." 

A GCT has the FSM property, or the tree "is FSM," if the tree T defines a next 
state function / :S? x A^> Sf such that for any sequence x n+1 , 

20 s T (x n+l ) = f(s T (x n ),x n+1 ) (3) 

For the binary tree of FIG. 4, the state following the transmission of a 1 at state 0 
in tree 400 could either be "100" or "101." The tree therefore does not have the finite 
state machine property. The system therefore needs additional past symbols to make a 
conclusive determination of the state beyond the symbols provided by the length-one 
25 context at root node 403. 
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One possible way of verifying whether a GCT T is FSM is by means of the 
"suffix property." If, for every permanent state s, the suffix tail(s) is a node of T, then T 
is FSM. In this case, the next state function f satisfies, for all a € A , f(s,a) equals V T (as), 
where V T (as) represents the first element, r, of C T (as). 

5 Note that the GCT 400 in FIG. 4 does not satisfy the suffix property because the 

descendants of node 451 are not replicated at node 450, i.e. neither suffix "00" nor suffix 
"01" is present. To make a tree T that is not FSM into a tree that is FSM, the system 
must add nodes and/or edges to the tree T to ensure conformance with Equation (3). 

The present design computes a GCT T suf by taking T and adding, as nodes, all 
10 suffixes of the nodes of T. Addition of a node may cause a composite edge, or an edge 
labeled with more than one single letter string, to split. If, for example, w is a node of T 
with an outgoing edge uv, and the construction of the suffix tree calls for adding the node 
wu, the edge w -> wuv is split into w-» wu-» wuv. 

T suf is a "refinement" of T, where refinement means a "refinement function" g 
15 exists such that s T (x) = g(s Tsu/ (x)) for every string x. In other words, given the state 

assigned by T suf to a string x, the system can determine the state assigned by T even if x 
is unknown. A GCT can be refined by or be a refinement of an FSM or another GCT. A 
"minimal" refinement of a GCT T which is FSM, but is not necessarily a GCT, is called 
an "FSM closure" of T, where minimal in this context indicates having a minimal number 
20 of permanent states. T suf is one possible FSM closure of T. 

FIG. 5 illustrates a GCT T F having FSM properties which is an FSM closure of 
the tree of FIG. 4. New nodes 501 and 502 added to the tree T are shaded. FIG. 6 shows 
the finite state machine associated with the tree T F of FIG. 5. Transient states and their 
transitions are indicated by dashed lines. 

25 An efficient method for building T su f is disclosed in U.S. Patent Application 

10/768,904, entitled "FSM Closure of Generalized Tree Models," inventors Alvaro 
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Martin et al., again which is incorporated herein by reference. The method may operate 
in a time frame linear in the sum of edges lengths over all edges of T and the number of 
nodes in T suf . The algorithm begins with a representation of T and adds the necessary 
nodes and edges to construct T su f. The algorithm also builds a structure Transitions[w] 
that determines the next-state function for each permanent state w and for each transient 
state w$ such that w is also a node of T su f. An additional structure, Origin[w], associates 
to each node w in T suf the original node in T from which w descends, i.e., provides the 
refinement function from T su f into T. 

A GCT T and a set of probability distributions on symbols of the alphabet 
conditioned to states of T can be used as a model for a finite memory source. Such a 
model is termed a Generalized Context Tree Model (GCT Model). The probability 
assigned by a GCT model with tree T to a string x n is: 

P(x n ) = flP(x i \s T (x i - 1 )) (4) 

f=i 

where P(a \ s) is the probability of symbol a conditioned on the state s. 
Universal Coding for the Class of Tree Models 

The present design implements the semi-predictive variant of Context in time 
linear in the length of the input string. This variant operates in two passes for the encoder 
side. In a first read of the input string x n = x, the system determines an optimum full tree 
for coding x, where optimum is a relative term meaning a full tree supporting a model 
that minimizes the code length. The supported model changes dynamically during the 
encoding process. The system encodes the optimum full tree as a binary sequence and 
sends the binary sequence to the decoder. In a second pass, the system sequentially 
encodes x using an arithmetic encoder. The system computes the probabilities used by 
the arithmetic encoder according to the same dynamic tree model, by calculating a 
conditional probability for each symbol of the string one at a time. The probability 
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calculated for Xj is based on statistics of symbol occurrences kept in the state stCx 1 " 1 ) in 
the optimum full tree. The total probability assigned to x is the product over all 
probabilities assigned to each symbol Xj. One such sequential probability assignment the 
system may use is the Krichevsky-Trofimov (KT) assignment, which assigns a 
probability 



2n MI (s') + l 

p(x j+l =a)= (5) 



where n s , a (x j ) denotes the number of times symbol a was encoded in state s = s T (x j ) during 
process of x j . 

The decoder operates by first receiving the binary sequence describing the 
10 optimum tree and reconstructing the tree. The system uses an arithmetic decoder to 
sequentially decode each symbol of x. As the state used by the encoder to encode x* 
depends only on past symbols xi . . . x^ decoded before x*, the decoder determines the 
same state in the tree model and uses the same statistics as the encoder, thereby 
calculating the same probability distribution to decode xj. 

15 Embodiments of the encoding and decoding algorithms of the present design are 

illustrated in FIGs.7-12. 

Encoding 

From FIG. 7, the encoder receives the string x and builds a suffix tree of the 
reversed x in its first pass at point 706, represented by suffix tree ST 701. The encoder 
20 prunes the suffix tree into a GCT T\ or T'(x), at point 707 as shown as the top half of 

pruned suffix tree triangle 702. The encoder passes T'(x) along two paths, the first being 
to build an FSM closure of T'(x) into T' F , shown as box 703, and the second to obtain t, a 
binary representation of the full tree T'fun, the full representation of the pruned tree T'(x), 
at box 704. The full tree T'fun is the optimum coding tree T(x) for x. The encoder then 
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encodes x into a binary string y using states in T transitioning with finite state machine 
T' F in the second pass shown in box 705. The encoder uses T' F to transition efficiently 
from one state to the next state, but the states that determine the set of statistics the 
encoder uses for coding are still according to T\ which may have fewer states than T' F . 
5 The output of the encoder is then ty, a binary representation t of T'fun concatenated with 
binary string y. 

The coding process may be performed at the encoder in two encoding passes. In 
the first pass, the system computes a relatively optimum coding tree T(x) for a string x by 
minimizing an appropriate cost function that accounts for the number of bits needed for 

10 representing the optimum coding tree as a binary sequence, plus the code length for x 
using a statistical model based on the full tree T(x). For encoding T(x), a representation 
known to those skilled in the art as natural encoding may be used to encode the 
information, requiring one bit per node. The code length for x depends on the method of 
assigning symbol probabilities based on statistics maintained in the tree states. The KT 

15 assignment is one example of such an assignment. The present design computes a GCT 
T'(x), or T\ where T is the GCT obtained from T(x) by deleting all nodes that did not 
occur as substrings of x , as well as any node u whose outgoing degree becomes one (1) 
after deleting those nodes, except if u is a prefix of x. The present design then derives 
T(x) as T'fim. 

20 The system derives T'(x) by first constructing ST(x), the suffix tree of F" 1 $ , and 

then "pruning" the suffix tree, with the possible insertion of additional nodes. Except for 
these additional nodes, which may be inserted to account for the fact that incoming edges 
of the leaves of T'(x) must be atomic, T'(x) is a subtree of ST(x). In a postorder traversal 
of ST(x), the program compares, for each node u, the sum of costs of the optimal subtrees 

25 rooted at all its children, with the cost of making u' a leaf, where u' < u and |u'| is equal 
to |PARst(x)(u)| plus one, where PAR is the parent of the node in question, or u' is equal 
to the root if u is equal to the root. The program may perform this comparison by 
recursively associating to each internal node visited the cost: 
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K(u) = min (a (|u| - |u'| + 1)] + £K(w), k(x n ,u)) 



where summation is performed for all w that are elements of children in ST(x) of u, and 
k(x", u) is the code length obtained for symbols of x encoded in state u using the KT 
sequential probability assignment, calculated as follows: 

r(«,(^"- 1 ) + -)r(-) a 

5 k(x n , u) A log 2 2__^ (7) 

r (f)EL, r(« I>fl (,")4) 

In Equation (7), n s (x jl ) denotes the number of occurrences of state s in the sequence 
St(X), stCx 1 ), st^x 2 ),. . .StCx 1 * 1 ), 1 < j < n, T is the gamma function, and n S)a (x j ) denotes 
the number of occurrences of a at state s in x j . The system recursively computes these 
values or counts as the sum of the corresponding counts over all children of s. The 
10 recursion starts from the leaves u$ of ST(x). The symbol a that follows $ u in x can be 
recorded during the suffix tree construction and associated to the leaf u$. The cost of a 
leaf of ST(x) is defined as log cl The program then marks u' as a leaf in case a minimum 
is achieved by the second argument in the right-hand side of Equation (6). 

A pruning function may be implemented using a function prune(u) that prunes the 
1 5 subtree rooted at u and returns K(u) from Equation (6). The function prune(u) may begin 
by calling recursively prune(w) for all children w of u and accumulating the returned 
values in a variable, such as a variable Y. This accumulation accounts for the summation 
in the first argument of Equation (6). Adding a{\ u | - 1 u'| + 1) to Y provides the first 
argument. 

20 The second argument of Equation (6) is calculated as stated in Equation (7) and 

stored in a variable, such as variable Z. Equation (6) indicates the program takes the 
minimum of both arguments, so if Y is less than Z, prune(u) returns Y. If Z is less than 
or equal to Y, prune(u) deletes the entire subtree rooted at u, adds u 1 as a leaf, and returns 
Z as the cost. 
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Once the system determines T'(x), or T\ the system may compute T(x) as T' Mh 
where T'fuu is the tree obtained from T'(x) by adding all nodes necessary to become a full 
tree. The system encodes full tree T(x) with a natural code, which can be specified 
recursively with a pre-order traversal of T(x), to obtain the bitstream t, which can be sent 
5 to the decoder. 

The second phase begins by building T' F (x), an FSM closure of T'(x). For every 
permanent (non-transient) state w of T' F (x), and every symbol c, the encoder has access 
to the next state transition f(w,c) via the data array Transitions [w]. The Transitions data 
array also provides state transitions for all transient states associated with nodes. Further, 
10 the link Origin[w] points to the state of T(x) being refined by w, which accumulates all 
statistics for the associated probability assignment. Equipped with these structures, the 
encoder 101 proceeds to encode sequentially symbols of x requiring a constant bounded 
time for each symbol using an arithmetic encoder. 

Decoding 

15 One implementation of a decoder that has access to the GCT T'(x) is a decoder 

operating in a manner analogous to the second encoding phase described above. After 
building T' F (x), such a decoder would mimic the encoder 101 by assigning the same 
probability to Xi based on both T'(x) and x M , sequentially available to the decoder as 
previous symbols are decoded, where an arithmetic decoder would decode Xi. However, 

20 the bitstream t made available from the encoder 101 only describes T(x) to the decoder 
102, not T'(x). One implementation of the decoder 102 could apply the same procedure 
as a decoder knowing T'(x), using T(x) instead of T'(x) to recover x. One problem with 
such an approach is that the FSM closure of T(x) may produce a relatively large number 
of nodes, which adversely affects the complexity of the system. Another approach 

25 enables the encoder 101 to transmit T'(x), instead of T(x) through the bitstream t. 

However, the description of T'(x) increases the number of bits in t since T'(x) may not be 
a full tree. One way of describing T'(x) is by sending an additional bit for each node of 
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T(x) specifying whether each node is a node of T'(x). In this implementation 2|t| bits 
sufficiently describe T'(x). Such an encoder would produce a code length which retains 
the asymptotic optimality properties of the code, but which is larger than necessary. 

Rather than transmitting T'(x), increasing the number of coding bits, or building 
5 the FSM closure of T(x) increasing the complexity of the system, three solutions may be 
employed to provide the information at the decoder at low complexity and without extra 
bits: Decoding using an incomplete FSM closure, decoding using an incremental FSM 
closure construction, and decoding using incremental suffix tree construction. Decoding 
using an incomplete FSM closure computes the FSM closure of a tree containing a subset 

10 of the nodes of T'(x). The decoder determines the state in T(x) using this subset of 

nodes, requiring additional processing after each FSM state transition, thereby decoding 
the string x using the same statistics as the encoder. Decoding using incremental FSM 
closure construction operates in a similar manner to decoding using an incomplete FSM 
closure, but rather than recalculating the state in T(x) corresponding to a certain state s in 

15 the FSM each time s is visited, the decoder expands the FSM closure to accommodate 
new states and eventually becomes equal to TV (x), the FSM closure of T'(x) employed 
by the encoder. The third decoding alternative, decoding using incremental suffix tree 
construction, does not use an FSM closure. Instead, the decoder expands T(x) by adding 
all suffixes of x sequentially as symbols are decoded. The decoder can determine the 

20 state in T(x), using the same statistics as the encoder, by maintaining the nearest ancestor 
originally existing in T(x) for each added node,. 

In all of these cases, decoding occurs by loop processing one symbol at a time 
from the string, beginning at symbol xj and processing up to x n . As used herein, sub 
index i denotes instant values of variables at the time i for symbols already decoded. For 
25 example, if s denotes a state on a context tree, Si denotes value of s before decoding 
symbol Xi+i 
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FIG. 8 illustrates a conceptual flow diagram of an embodiment of the decoder 
building an incomplete FSM closure. The decoder receives ty from the encoder, and 
builds a tree T(x) from the t string received at point 801. The decoder then builds 7"(jc) , 
a tree obtained from T(x) by deleting all leaves and also all nodes whose number of 
5 children after deleting leaves is 1, at point 802. A mapping to T(x), known as a Jump 
structure, may also be built at this point. A sample Jump structure, explained below, is 
illustrated in FIG. 11. The decoder then builds an FSM closure of T (x) , updating the 
mapping structure Jump when appropriate, at point 803. The result is T F (x) , which 
passes into a loop beginning with point 804. The present design links each state s i to Si, 
10 where Si is the state required by decoder 102 and s i is the available state in the FSM 
T F (x) . The link is given by defining Z[ to be such that s i z\ is the longest prefix of x i 
that is a word of T F (x) , where the $ symbols were removed from transient states. 
Further, bj is defined as * H ^ Z/ | if | s i z\ | is less than i, and X otherwise. Based on these 
definitions, the state Sj can be determined as s { = s i z i b i . 

15 Point 804 determines state s in T(x) based on state s in T F {x) , z, and b using 

the Jump mapping structure. Point 805 decodes the symbol using statistics corresponding 
to state s. Point 806 transitions to the next state s and updates z and b. Point 806 cycles 
back to point 804 and these last three points repeat until all symbols in x have been 
determined by the decoder 102. 

20 The Jump structure of FIG. 1 1 illustrates the data structure Jump[u], whose goal 

is to facilitate determination of Si in constant time per input symbol, without revisiting the 
decoded sequence for all the symbols in z x . For every internal node u of T(x) that is also 
a node of T\x) , the set A u of symbols may exist for which u has an edge of T(x) in the 
direction of the symbols, and u(a) denotes the edge of T(x) in the direction of a <= A u . 

25 For every j, v u , a (j) denotes the node of T(x) obtained by concatenating the first j symbols 
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of the label of the edge u(a) to u. The data structure Jump[u] links u with each v u , a (j) and 
may be produced by the system in linear time for all relevant nodes, such as by setting up 
the Jump data structure for the nodes of f'(x) and updating the data structure as edges of 
f'(x) are split by the algorithm. 

5 FIG. 12 represents an embodiment of the decoder implementation that builds an 

incomplete FSM closure using T(x) received from the encoder. From FIG. 12, line 1 sets 
T{x) equal to a compacted version of T(x) minus the leaves of T(x), where "compact" 
simply indicates eliminating nodes with exactly one child. Reassignment or removal of 
these intermediate nodes in the tree provides a compacted tree structure that can be FSM 
10 closed with relatively low complexity. Line 2 of FIG. 12 computes T F (x) , an FSM 
closure of f'(x) . Line 3 sets S , a pointer, to point to the root node of the FSM closed 

tree T F (x) . Line 4 sets the variables zlength to 0 and b equal to the root node, where 
zlength is a string length and b is a symbol or X. 

The present design recursively determines zlength, which has the value of \z\\ for 
15 each iteration, starting with z 0 equals X, by checking decoded symbols and descending 

T F (x) starting from s \. Only up to |uj| + 1 symbols may be checked where Ui is defined 
as the string satisfying jt,i M = s i u i . If at some point, the concatenated string is not a 
word of T F (x) , zlength is the current length. Otherwise, zlength is equal to |ui| + |zu|. 

Line 5 begins a While loop that operates until the end of the input received from 
20 the encoder, and line 6 begins an if statement determining whether zlength is greater than 
zero. If zlength is greater than zero, line 7 determines head(z) using symbols that have 
been decoded. At line 8, if pointer s is an internal node of T(x) and head(z) is a member 
of AS , the set of symbols for which s has an edge in the direction of the symbols, the 
subroutine sets s equals S zb. Line 9 begins a series of elseif commands providing 
25 alternates to line 8, where line 9 checks if pointer s is an internal node of T(x) and 
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head(z) is not a member of Af . In this case, s is set to s head(z). Line 10 evaluates 
whether s is a leaf of T(x), and if so, sets s equal to s . Line 1 1 is the default, setting s 
equals Origin(f ). If zlength is not equal to zero, lines 14 and 15 evaluate whether s is a 
node of T(x), and if so, sets s equal to s , otherwise, sets s equal to Origin(s ). At line 17, 
5 the subroutine decodes the next symbol using statistics conditioned on s, then updates the 
statistics in s, sets s to the next state in T F (jc) according to the decoded symbol, and 
updates the values of zlength and b. Line 2 1 ends the While loop begun at line 5. 

FIG. 9 illustrates a conceptual flow diagram of an embodiment of the decoder that 
builds an incremental FSM. Again, the decoder receives ty from the encoder. At point 

10 901, the decoder builds the tree T(x) from the bitstream t received. At point 902, the 

decoder builds f'(x) and builds an FSM closure of f '(jc) at point 903. This FSM closure 
is represented as T\ . T' F passes to the loop beginning at point 904, which determines 
the state s in T(x) based on state s in T F , z , and b . z and b are similar to z and b 
defined above, but for the FSM closure T* F . The value of z for each i, z i , is such that 

15 7 i z. is the longest prefix of jc' that is a word of f\ . Further, b. is defined as jc . .„ . if 
I Si z. | is less than i and?;, z. * M ?,?,| is a node of the FSM closure of T(x), T F (x), and X 
otherwise. Based on these definitions, the state Si is equal to 's i z i b i . Point 905 adds s and 
its suffixes to T F . Point 906 decodes the symbol using the statistics in Origin[s], while 
point 907 transitions to next state s in T F . These final four points repeat within the 

20 decoder until all states in T(x) are determined. The decoder then provides x based on 
these states. 

FIG. 13 illustrates a subroutine for decoding using an incremental FSM closure. 
Line 1 sets T'(x) equal to a compacted version of T(x) minus the leaves of T(x), similar 
to line 1 of FIG. 12. Line 2 of FIG. 12 computes f\ , an FSM closure of T'(x). Line 3 
25 sets s , a pointer, to point to the root node of the FSM closed tree T F . Line 4 sets the 



HEWP002 1 - 2003 1 3296- 1 



-27- 

variables z length to 0 and b equal to the root node. Line 5 begins a While loop that 
operates until the end of the input received from the encoder, and line 6 begins an if 
statement determining whether z length is greater than zero. If z length is greater than 
zero, line 7 determines head( z ) using the symbols already decoded. Line 8 creates node 
5 ? z by splitting the edge departing or emanating from s having the first symbol 

head(z )• Line 9 sets r equal to s z , and line 10 sets the Transitions(r) array equal to the 
Transitions array for 7 , where the initial structure Transitions [w] was generated by the 
algorithm that built the FSM closure and determines the next-state function for each 
permanent state w and for each transient state w$ such that w is also a node of the FSM 
10 closure. Line 1 1 Verifies r, where the Verify* subroutine is employed at this point with 
argument r. 

Verify* is identical to the Verify routine discussed in detail in U.S. Patent 
Application 10/768,904, entitled "FSM Closure of Generalized Tree Models," inventors 
Alvaro Martin et al., again which is incorporated herein by reference, except for an 

15 update to the Origin data array and incoming and outgoing transitions during node 

insertions. For a node uv created in f * F as a leaf child of u, the system sets Origin[uv] to 
uv if u is an internal node of T(x), or to Origin[u] otherwise. If the system creates uv by 
splitting an edge u — ^^uvy into u — ^uv — ^->uvy, the system sets Origin[uv] to uv 
if uv is an internal node of T(x), or to Origin[uvy] otherwise. The result is that for any 

20 leaf u of T(x), Origin[uv] is equal to Origin[uw] for all nodes uv and uw of 7% . If c is 
equal to head(w), Verify*(w) sets f(y,c) equal to w in Transitions [y] for node y equals 
tail(w) and also for all descendants yt that share its same value for f(yt,c) equals f(y 5 c). In 
addition, if the system creates y as a new child of t, the Verify* routine sets f(y,c') equal 
to f(t,c') in Transitions[y] for all c' that are elements of A and different from c. 

25 From FIG. 13, the decoder then proceeds by evaluating whether z length is equal 

to zero, and if so, sets r equal to s at line 13. If b is not a zero length string, line 16 
creates the node rb , line 17 sets Transitions of rb equal to Transitions for r, and line 18 
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calls the Verify* subroutine with the rb argument. Line 19 sets s equal to rb . If b is a 
zero length string, s is set equal to r. Line 23 decodes the next symbol using the statistics 
in Origin(s), line 24 updates the statistics in Origin(s), line 25 sets 7 to the next state in 
T % F according to the decoded symbol, and line 26 updates the values of z length and b . 
As with the previous zlength, the present design recursively determines z. length, or | z J, 
starting with z 0 equals \ by checking decoded symbols and descending 7% starting 
from s, . The decoder may check up to | u i | +1 symbols. u i is defined as equal to X if 
I s> \>\ SJ-i | +1 , or as the string satisfying x t s s _ x = otherwise. If at some point, the 
concatenated string is not a word of T F , z length is the current length. Otherwise, 
z length is equal to | u i | + 1 z M | . Line 27 ends the While loop. 

Determining whether a node u is an internal node of T F (x) may employ an extra 
Boolean flag Internal[u], computed incrementally for every node of T' F . Internal[u] is 
initially true for all nodes, and the system sets Internal[u] false for all nodes added to 
build T' F , except those nodes created by splitting an edge u — ^^uvy into 

u — ^->uv — ^->uvy where Internal[uvy] is true. A similar technique may be used to 
determine whether a node is an internal node of T(x). Further, transition propagations in 
Verify* may employ a representation of Transitions [u] that groups all transitions of all 
nodes leading to the same destination in one location. All nodes uy having a transition 
f(uy, a) equal to au maintain a pointer Transition[uy, a] to a cell that points to au. The 
system updates all descendents sharing the same transition by changing the value of the 
cell Transition[uy,a] and creating a new value for all nodes in the path between u and uy, 
excluding uy. 

FIG. 10 illustrates a conceptual flow diagram of an embodiment of the decoder 
using suffix tree construction. The decoder receives ty, and builds tree T(x) from t at 
point 1001. The decoder then sets x to the empty string at point 1302. Point 1003 places 
the reversed x, now an empty string, in Tj(x), where Tj(x) denotes a successively updated 
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tree initialized as T 0 (x), originally equal to T(x). Point 1004 decodes the symbol using 
statistics in the longest ancestor of the reversed x that was originally in T(x). Point 1005 
appends the decoded symbol to x, and passes the new value of x to point 1003. Points 
1003 through 1005 repeat until all symbols have been decoded. 

5 FIG. 14 illustrates a subroutine for decoding using incremental suffix tree 

construction. The algorithm employs short-cut links for each node and each symbol of 
the tree. The short-cut link for a node u and a symbol a points to node auv, where v is the 
shortest string such that auv is an element of T(x) or is undefined if no such node exists. 
The algorithm starts by initializing short-cut links for all nodes in the initial tree T(x) and 

10 incrementally updates the structure as the system inserts additional nodes. Common 
statistics are accessed by keeping a pointer to the nearest ancestor originally in T(x) in 
each inserted node. While constructing Ti(x), the system searches for the longest prefix 
r'i of the suffix being inserted, x l , that is a word of the previous tree, T M (x). This 
longest prefix is the insertion point for the new suffix 3c 1 . r x is the longest prefix of x i 

1 5 that is a substring of x'~ l . r 4 is a prefix of x^-i, x* being a symbol, and r'i is equal to r;Ui 
for some string u. 

Construction of Ti(x) starts from a node rVi, where r' 0 is X, and traverses the tree 
upward until it reaches the first ancestor of rVi that has a short-cut link defined for 
symbol Xi or the root is reached. This is called an upward traversal, and vj denotes the 

20 node found in an upward traversal and Wj the node pointed to by the corresponding short- 
cut link, or the root if no short-cut link exists. The algorithm also performs a downward 
traversal, from w 4 to r'i, comparing symbols in the path with previously decoded symbols. 
Once the algorithm has determined r' i5 the algorithm adds a new leaf representing x* , 
and assigns short-cut links to all nodes in the unsuccessful portion of the upwards 

25 traversal pointing to the new leaf for symbol xj. 

As shown in FIG. 14, line 1 initializes short-cut links for T(x). Line 2 sets 
variables r' and s equal to the root node. Line 3 begins a While loop that evaluates all 
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input received. Line 4 decodes Xi using the statistics in s, and traverses upward at line 6 
et seq. by setting v equal to r\ While v is not the root node, and v has no short cut link 
for Xj, line 8 sets v equal to the parent of v. Line 9 ends the While loop begun at line 7. 
At line 10, if v has a short cut link for Xi , line 1 1 sets w equal to the node pointed to by 
5 the short cut link of v for Xj. Otherwise, w is set equal to the root at line 13. If the length 
of string w is greater than the length of string v plus one, line 16 splits the edge from the 
parent of w to w by inserting the node xjv. Line 17 sets r new equal to xjv, while line 18 
sets u equal to v. While the short cut of u for xi is equal to w, line 20 sets a short cut link 
of u for xj pointing to r new . Line 21 checks whether u is not equal to the root, and if so, 

10 sets u equal to the parent of u. If the length of string w is not greater than the length of 
string v plus one, the program executes a downward traversal as shown by the comment 
at line 24. The algorithm employs a data structure jump[vj], where the algorithm 
constructs the jump data structure during each downward traversal to map the symbol a 
equal to Xi and an index j to the jth traversed node in relatively constant time. This 

15 structure is a mapping between substrings y of a composite edge departing from Vi into 
nodes xjvjy, and may be updated whenever a composite edge is split. In the algorithm, if 
jump[v] defines a mapping for x i9 the algorithm at line 26 sets r new equal to the last 
entrance of jump[v] for xj. If jump[v] does not define a mapping for Xj, r new is set equal to 
w, and the program sets j equal to |r new |. While i minus j is greater than zero, and r new has 

20 a child in the direction of Xi.j, the program sets r new equal to a child of r new in the direction 
of Xi-j. Line 32 updates jump [v]. Line 33 increments j, and the program ends the While 
loop and If statements from lines 30, 25, and 15. Line 37 adds a child to r new representing 
the suffix x l . For all nodes in the path from r' to v, the program sets a short-cut link for 
symbol Xi pointing to the new node 3c' . r' is set to r new , and s is set to the longest prefix 

25 of x l that was originally in T(x). Line 43 ends the While loop begun at line 3. 



It will be appreciated to those of skill in the art that the present design may be 
applied to other systems that employ predictive coding of information sources, 
particularly those using data structures for finite memory sources that may be beneficially 
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universally coded, transmitted, and decoded and having enhanced predictive capability 
and minimal loss of quality. In particular, it will be appreciated that various universal 
coding schemes using tree models and/or finite state machines may be addressed by the 
functionality and associated aspects described herein. 

5 Although there has been hereinabove described a method for data compression 

based on tree models, for the purpose of illustrating the manner in which the invention 
may be used to advantage, it should be appreciated that the invention is not limited 
thereto. Accordingly, any and all modifications, variations, or equivalent arrangements 
which may occur to those skilled in the art, should be considered to be within the scope 
10 of the present invention as defined in the appended claims. 



